Time Series Data Mining Algorithms for Identifying Short RNA in Arabidopsis thaliana

نویسندگان

  • Anthony Bagnall
  • Simon Moxon
  • David J. Studholme
  • Vincent Moulton
چکیده

The class of molecules called short RNAs (sRNAs) are known to play a key role in gene regulation. Th are typically sequences of nucleotides between 21-25 nucleotides in length. They are known to play a key role in gene regulation. The identification, clustering and classification of sRNA has recently become the focus of much research activity. The basic problem involves detecting regions of interest on the chromosome where the pattern of candidate matches is somehow unusual. Currently, there are no published algorithms for detecting regions of interest, and the unpublished methods that we are aware of involve bespoke rule based systems designed for a specific organism. Work in this very new field has understandably focused on the outcomes rather than the methods used to obtain the results. In this paper we propose two generic approaches that place the specific biological problem in the wider context of time series data mining problems. Both methods are based on treating the occurrences on a chromosome, or “hit count” data, as a time series, then running a sliding window along a chromosome and measuring unusualness. This formulation means we can treat finding unusual areas of candidate RNA activity as a variety of time series anomaly detection problem. The first set of approaches is model based. We specify a null hypothesis distribution for not being a sRNA, then estimate the p-values along the chromosome. The second approach is instance based. We identify some typical shapes from known sRNA, then use dynamic time warping and fourier transform based distance to measure how closely the candidate series matches. We demonstrate that these methods can find known sRNA on Arabidopsis thaliana chromosomes and illustrate the benefits of the added information provided by these algorithms.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Functional analysis of glycin rich- RNA binding protein, a suppressor of trehalose-6-phosphate mediating growth arrest in Arabidopsis thaliana

Metabolism of the alpha-1,1 glucose disaccharide, trehalose, is indispensable in plants. In the Murashigeand Skoog (MS) medium, trehalose inhibits plant growth and allocation of carbon to roots. A suppressorof trehalose-6-phosphate (T6P) mediated growth arrest, GR-RBP2, is characterized in more detail.Phylogenetic analysis revealed that GR-RBP2 is a protein of likely prokaryot...

متن کامل

Differential Expression of Arabidopsis thaliana Acid Phosphatases in Response to Abiotic Stresses

The objective of this research is to identify Arabidopsis thaliana genes encoding acid phosphatases induced by phosphate starvation. Multiple alignments of eukaryotic acid phosphatase amino acid sequences led to the classification of these proteins into four groups including purple acid phosphatases (PAPs). Specific primers were degenerated and designed based on conserved sequences of PAPs isol...

متن کامل

ساخت و تکثیر cDNA از mRNA ژن رمز گردان آنزیم P5CS گیاه Arabidopsis thaliana

گیاهان وقتی در معرض خشکی و شوری قرار می گیرند با تنش اسمزی مواجه می شوند که برای مقابله با تنش اسمزی ایجاد شده تجمع اسمولیتهایی همچون پرولین ، گلیسین بتانین و یا سایر ترکیبات مشابه را افزایش می دهند.در مسیر بیوسنتز این اسمولیتها آنزیمهایی دخالت دارند که بعضی از آنها کلیدی به حساب می‌آیند و با تغییر در تظاهر ژنهایی که این آنزیمها را کد می کنند می توان فراوردة نهایی که اسمولیت مورد نظر می‍‍باشد ...

متن کامل

Yeast Two Hybrid cDNA Screening of Arabidopsis thaliana for SETH4 Protein Interaction

SETH4 coding sequence with 2013 bp is a member of gene family expressed in gametophytic tissues of Arabidopsis thaliana. This fragment was PCR amplified using Kod Hi Fi DNA polymerase enzyme. This fragment was cloned into pGBKT7 bate vector and transformed E. coli DH5? cells containing vector were selected on LB medium containing Kanamycin. Finally, pGBKT7-SETH4 bate was transformed into yeast ...

متن کامل

Algorithms for Segmenting Time Series

As with most computer science problems, representation of the data is the key to ecient and eective solutions. Piecewise linear representation has been used for the representation of the data. This representation has been used by various researchers to support clustering, classication, indexing and association rule mining of time series data. A variety of algorithms have been proposed to obtain...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2008